A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

نویسندگان

  • Adriana Iamnitchi
  • Ian T. Foster
چکیده

The idle computers on a local area, campus area, or even wide area network represent a significant computational resource—one that is, however, also unreliable, heterogeneous, and opportunistic. We describe an algorithm that allows branch-and-boundproblems to be solved in such environments. In designing this algorithm, we faced two challenges: (1) scalability, to effectively exploit the variably sized pools of resources available, and (2) fault tolerance, to ensure the reliability of services. We achieve scalability through a fully decentralized algorithm, in which the dynamically available resources are managed through a membership protocol. We guarantee fault tolerance in the sense that the loss of up to all but one resource will not affect the quality of the solution. For propagating information reliably, we use epidemic communication for both the membership protocol and the fault-tolerance mechanism. We have developed a simulation framework that allows us to evaluate design alternatives. Results obtained in this framework suggest that our techniques can execute scalably and reli-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Case Study of Agreement Problems in Distributed Systems: Non-Blocking Atomic Commitment

This paper considers an agreement problem whose practical interest is well known, namely the Non-Blocking Atomic Commitment Problem. First, a generic protocol solving this problem is given and then instantiations of its generic statements are provided for both synchronous and asynchronous distributed systems. These instantiations use a few basic components: timeout mechanism and reliable multic...

متن کامل

Consensus in Asynchronous Distributed Systems

The distributed consensus problem arises when several processes need to reach a common decision despite failures. The importance of this problem is due to its omnipresence in distributed computation: we need consensus to implement reliable communications, atomic commitment, consistency checks, resources allocations etc. The solvability of this problem is strictly related to the nature of the sy...

متن کامل

Revisiting the Non-Blocking Atomic Commitment Problem in Distributed Systems

Agreement problems allow a set of processes to agree on a common output value. These problems are of primary importance in distributed systems and di cult to solve in presence of failures. This paper considers one of these problems whose practical interest is well known, namely the Non-Blocking Atomic Commitment Problem. First, a generic protocol solving this problem is given and then instantia...

متن کامل

Fault-Tolerant Distributed Systems: a Modular Approach to the Non-Blocking Atomic Commitment Problem

Agreement problems allow a set of processes to agree on a common output value. These problems are of primary importance in distributed systems and di cult to solve in presence of failures. This paper considers one of these problems whose practical interest is well known, namely the NonBlocking Atomic Commitment Problem. First, a generic protocol solving this problem is given and then instantiat...

متن کامل

Solving Consensus in a Byzantine Environment Using an Unreliable Fault Detector

Unreliable fault detectors can be used to solve the consensus problem in asynchronous distributed systems that are subject to crash faults. We extend this result to asynchronous distributed systems that are subject to Byzantine faults. We define the class 3S(Byz) of eventually strong Byzantine fault detectors and the class 3W(Byz) of eventually weak Byzantine fault detectors and show that any B...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000